quick pure perl question

quick pure perl question

am 28.06.2009 17:41:44 von aw

Hi.
By curiosity, and just in case anyone knows off-hand :

perl 5.8.8

In a script, I substantially do this :

open(FIRST,'<:utf8',$name1);
open(SECOND,'>:raw',$name2);
while(defined($line = )) {
print SECOND $line;
}

and I get warnings : "wide character in print to ,.."

I mean, I know that my data is UTF-8, and I know that some characters
are going to be "wide", and that's how I want them.
I also know that I could specify the output I/O layer as 'utf8' (which
avoids the warning).
But why do I get warnings when I specified 'raw' as the I/O layer ?
Doesn't 'raw' mean like 'as is' ?

Re: quick pure perl question

am 28.06.2009 18:12:22 von Mike OK

Check out this man page http://perldoc.perl.org/functions/open.html For
encoding UTF8, the example is

open(FH, "<:encoding(UTF-8)", "file")

Mike

----- Original Message -----
From: "André Warnier"
To: "mod_perl list"
Sent: Sunday, June 28, 2009 11:41 AM
Subject: quick pure perl question


> Hi.
> By curiosity, and just in case anyone knows off-hand :
>
> perl 5.8.8
>
> In a script, I substantially do this :
>
> open(FIRST,'<:utf8',$name1);
> open(SECOND,'>:raw',$name2);
> while(defined($line = )) {
> print SECOND $line;
> }
>
> and I get warnings : "wide character in print to ,.."
>
> I mean, I know that my data is UTF-8, and I know that some characters are
> going to be "wide", and that's how I want them.
> I also know that I could specify the output I/O layer as 'utf8' (which
> avoids the warning).
> But why do I get warnings when I specified 'raw' as the I/O layer ?
> Doesn't 'raw' mean like 'as is' ?
>
>
>
>

Re: quick pure perl question

am 28.06.2009 18:33:28 von Bill Moseley

On Sun, Jun 28, 2009 at 8:41 AM, Andr=E9 Warnier wrote:
> Hi.
> By curiosity, and just in case anyone knows off-hand :
>
> perl 5.8.8
>
> In a script, I substantially do this :
>
> open(FIRST,'<:utf8',$name1);
> open(SECOND,'>:raw',$name2);
> while(defined($line =3D )) {
> =A0print SECOND $line;
> }
>
> and I get warnings : "wide character in print to ,.."
>
> I mean, I know that my data is UTF-8, and I know that some characters are
> going to be "wide", and that's how I want them.
> I also know that I could specify the output I/O layer as 'utf8' (which
> avoids the warning).
> But why do I get warnings when I specified 'raw' as the I/O layer ?
> Doesn't 'raw' mean like 'as is' ?

You are decoding into characters when reading in. Perl sets the utf8
flag on $line to indicate that $line is character data. Then you are
attempting to write characters (which is an abstraction) out as byte
data. Perl warns you that you are doing this because the utf8 flag is
set.

You need to encode the character data before writing back out either
by encoding explicitly or using a layer.



--=20
Bill Moseley
moseley@hank.org

Re: quick pure perl question

am 30.06.2009 14:17:52 von Brock Diegel

On 28 Jun 2009, at 17:33, Bill Moseley wrote:
> You need to encode the character data before writing back out either
> by encoding explicitly or using a layer.


Or possibly not decode it in the first place and treat it as an opaque
octet stream. All depending, of course, on what it is you're trying to
achieve.

--
Andy Armstrong, Hexten

Re: quick pure perl question

am 30.06.2009 15:13:26 von aw

Andy Armstrong wrote:
> On 28 Jun 2009, at 17:33, Bill Moseley wrote:
>> You need to encode the character data before writing back out either
>> by encoding explicitly or using a layer.
>
>
> Or possibly not decode it in the first place and treat it as an opaque
> octet stream. All depending, of course, on what it is you're trying to
> achieve.
>

I was not trying to achieve anything, and I do understand the
encoding/decoding aspect.
Basically, by using the '>:raw' encoding for the output stream, I was
not expecting perl to warn me that I was (knowingly) outputting "wide
characters" there, so I was surprised at the warning.

I /would/ have expected it if I was /not/ specifying an encoding, like
using simply '>'. But not when I am explicitly specifying '>:raw',
which in my mind, and according to my interpretation of the on-line
documentation, is equivalent to saying "output whatever you have as
bytes in that string variable right now, as is, I know what I'm doing".
But I guess my interpretation of the documentation is incorrect then.

Re: quick pure perl question

am 30.06.2009 15:33:22 von Brock Diegel

On 30 Jun 2009, at 14:13, Andr=E9 Warnier wrote:
> I /would/ have expected it if I was /not/ specifying an encoding, =20
> like using simply '>'. But not when I am explicitly specifying =20
> '>:raw', which in my mind, and according to my interpretation of the =20=

> on-line documentation, is equivalent to saying "output whatever you =20=

> have as bytes in that string variable right now, as is, I know what =20=

> I'm doing".


You have that bit right - but the string doesn't contain bytes[1] - it =20=

contains characters. Strings can either be an octet stream or a stream =20=

of wide characters. By reading utf8 into a string you've turned it =20
into the latter. Perl's warning that you're pushing character data =20
into an octet hole.

[1] of course it's /made/ of bytes but that's not how Perl sees it.

--=20
Andy Armstrong, Hexten

Re: quick pure perl question

am 30.06.2009 15:45:15 von Bill Moseley

--000e0cd255b8356a2c046d9104d5
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

On Tue, Jun 30, 2009 at 6:13 AM, Andr=E9 Warnier wrote:

> Basically, by using the '>:raw' encoding for the output stream, I was not
> expecting perl to warn me that I was (knowingly) outputting "wide
> characters" there, so I was surprised at the warning.
>
> I /would/ have expected it if I was /not/ specifying an encoding, like
> using simply '>'. But not when I am explicitly specifying '>:raw', which=
in
> my mind, and according to my interpretation of the on-line documentation,=
is
> equivalent to saying "output whatever you have as bytes in that string
> variable right now, as is, I know what I'm doing".


I think it's because it's not bytes. Well, technically it's bytes of
course, but conceptually once you decode bytes you no longer have bytes.
You have that abstract idea of characters. And the only way to output that
information into a file (which hold bytes) is by first converting it to
bytes, and that requires encoding.

It's just like a thought you have in your brain. I'm not aware of any way
(yet) to output that in raw format -- must be encoded into typed, spoken, o=
r
signed language first. Even if most of what I write would be considered
pretty raw.

Isn't :raw mostly a way to use layers to say don't do CRLF conversion --
like the old use of binmode()? Oh, maybe not according to the docs.
It's best to decode and encode all character data at program boundaries and
stay away form Windows.


--=20
Bill Moseley
moseley@hank.org

--000e0cd255b8356a2c046d9104d5
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable



On Tue, Jun 30, 2009 at 6:13 AM, Andr=E9=
Warnier <aw@ice-sa.c=
om
>
wrote:
er-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-l=
eft: 1ex;">
Basically, by using the '>:raw' encoding for the output stream, =
I was not expecting perl to warn me that I was (knowingly) outputting "=
;wide characters" there, so I was surprised at the warning.



I /would/ have expected it if I was /not/ specifying an encoding, like usin=
g simply '>'. =A0But not when I am explicitly specifying '&g=
t;:raw', which in my mind, and according to my interpretation of the on=
-line documentation, is equivalent to saying "output whatever you have=
as bytes in that string variable right now, as is, I know what I'm doi=
ng".


I think it's because it's not bytes.=A0 Well, technically =
it's bytes of course, but conceptually once you decode bytes you no lon=
ger have bytes.=A0 You have that abstract idea of characters.=A0 And the on=
ly way to output that information into a file (which hold bytes) is by firs=
t converting it to bytes, and that requires encoding.


It's just like a thought you have in your brain.=A0 I'm not awa=
re of any way (yet) to output that in raw format -- must be encoded into ty=
ped, spoken, or signed language first.=A0 Even if most of what I write woul=
d be considered pretty raw.


Isn't :raw mostly a way to use layers to say don't do CRLF conv=
ersion -- like the old use of binmode()?=A0 Oh, maybe not according to the =
docs.
It's best to decode and encode all character data at program b=
oundaries and stay away form Windows.


=A0
--
Bill Moseley
ley@hank.org">moseley@hank.org


--000e0cd255b8356a2c046d9104d5--